{
"cells": [
{
"cell_type": "markdown",
"id": "immune-sauce",
"metadata": {},
"source": [
"# Chapter 7: Data Organization Methods\n"
]
},
{
"cell_type": "markdown",
"id": "useful-pakistan",
"metadata": {},
"source": [
"## Selecting, subsetting, and organizing data"
]
},
{
"cell_type": "markdown",
"id": "excessive-colony",
"metadata": {},
"source": [
"After having learned how to define and build a DataFrame, in this chapter,\n",
"we will address some of their basic features, largely applicable also to the\n",
"Series. Through a series of basic operations and examples of some concepts\n",
"related to them, we will discover how the DataFrame can be modified in its\n",
"structure and content as we wish. All operations are aimed at a common\n",
"purpose – preparing the DataFrame on which we are working and the data\n",
"within the manipulation"
]
},
{
"cell_type": "markdown",
"id": "guided-vocabulary",
"metadata": {},
"source": [
"## Learning Objectives"
]
},
{
"cell_type": "markdown",
"id": "beneficial-locking",
"metadata": {},
"source": [
"In this section, we will cover the following topics:\n",
"* Viewing\n",
"* Selection\n",
"* Filtering\n",
"* Editing\n",
"* Descriptive Statistics\n",
"* Transposition, Sorting, and Reindexing"
]
},
{
"cell_type": "markdown",
"id": "signal-spiritual",
"metadata": {},
"source": [
"After this chapter, you will be able to see and understand the\n",
"content of a DataFrame (viewing and statistics), select appropriate parts\n",
"(subsetting) or filter its elements based on appropriate conditions\n",
"(filtering), add new ones (editing), and finally modify the structure\n",
"(reindexing and transposition). All will be aimed at focusing on that part\n",
"of the DataFrame on which we will later want to operate (manipulation) –\n",
"which we will discuss in the next chapter."
]
},
{
"cell_type": "markdown",
"id": "242458f0",
"metadata": {},
"source": [
"## The head() and tail() methods"
]
},
{
"cell_type": "markdown",
"id": "fifth-milan",
"metadata": {},
"source": [
"To have a sufficient view to roughly understand the structure of a\n",
"DataFrame and get an idea of the data it contains, the Pandas library\n",
"provides us with the two methods `head()` and `tail()`. By calling these\n",
"methods on a reference DataFrame, the first five lines are printed, with the\n",
"`head()` method, or the last five lines with the `tail()` method, as follows:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "convenient-enough",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "947f6530",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.777040 | \n",
" 0.990383 | \n",
" 0.828056 | \n",
" 0.925998 | \n",
" 0.124980 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.865153 | \n",
" 0.293211 | \n",
" 0.334427 | \n",
" 0.038179 | \n",
" 0.412453 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.284549 | \n",
" 0.950626 | \n",
" 0.354305 | \n",
" 0.588351 | \n",
" 0.485069 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.559972 | \n",
" 0.819743 | \n",
" 0.247472 | \n",
" 0.197295 | \n",
" 0.882691 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.528820 | \n",
" 0.313048 | \n",
" 0.418015 | \n",
" 0.311277 | \n",
" 0.801971 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.633016 | \n",
" 0.949739 | \n",
" 0.607891 | \n",
" 0.683648 | \n",
" 0.968143 | \n",
"
\n",
" \n",
" 6 | \n",
" 0.115978 | \n",
" 0.949826 | \n",
" 0.714890 | \n",
" 0.044798 | \n",
" 0.045687 | \n",
"
\n",
" \n",
" 7 | \n",
" 0.194906 | \n",
" 0.231508 | \n",
" 0.226611 | \n",
" 0.300160 | \n",
" 0.274679 | \n",
"
\n",
" \n",
" 8 | \n",
" 0.803793 | \n",
" 0.338849 | \n",
" 0.441591 | \n",
" 0.912185 | \n",
" 0.433098 | \n",
"
\n",
" \n",
" 9 | \n",
" 0.999957 | \n",
" 0.051630 | \n",
" 0.180992 | \n",
" 0.302551 | \n",
" 0.769070 | \n",
"
\n",
" \n",
" 10 | \n",
" 0.574207 | \n",
" 0.783791 | \n",
" 0.540075 | \n",
" 0.179174 | \n",
" 0.894027 | \n",
"
\n",
" \n",
" 11 | \n",
" 0.569649 | \n",
" 0.371187 | \n",
" 0.195506 | \n",
" 0.594435 | \n",
" 0.400661 | \n",
"
\n",
" \n",
" 12 | \n",
" 0.591930 | \n",
" 0.032947 | \n",
" 0.252379 | \n",
" 0.805511 | \n",
" 0.200745 | \n",
"
\n",
" \n",
" 13 | \n",
" 0.942114 | \n",
" 0.786830 | \n",
" 0.649482 | \n",
" 0.539290 | \n",
" 0.229414 | \n",
"
\n",
" \n",
" 14 | \n",
" 0.380885 | \n",
" 0.768157 | \n",
" 0.886803 | \n",
" 0.509483 | \n",
" 0.054171 | \n",
"
\n",
" \n",
" 15 | \n",
" 0.752660 | \n",
" 0.208055 | \n",
" 0.084177 | \n",
" 0.714361 | \n",
" 0.550432 | \n",
"
\n",
" \n",
" 16 | \n",
" 0.350011 | \n",
" 0.312253 | \n",
" 0.605291 | \n",
" 0.356329 | \n",
" 0.355951 | \n",
"
\n",
" \n",
" 17 | \n",
" 0.603843 | \n",
" 0.572656 | \n",
" 0.800068 | \n",
" 0.381215 | \n",
" 0.382544 | \n",
"
\n",
" \n",
" 18 | \n",
" 0.321610 | \n",
" 0.778591 | \n",
" 0.626828 | \n",
" 0.018112 | \n",
" 0.221135 | \n",
"
\n",
" \n",
" 19 | \n",
" 0.332763 | \n",
" 0.619369 | \n",
" 0.669151 | \n",
" 0.406251 | \n",
" 0.625315 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4\n",
"0 0.777040 0.990383 0.828056 0.925998 0.124980\n",
"1 0.865153 0.293211 0.334427 0.038179 0.412453\n",
"2 0.284549 0.950626 0.354305 0.588351 0.485069\n",
"3 0.559972 0.819743 0.247472 0.197295 0.882691\n",
"4 0.528820 0.313048 0.418015 0.311277 0.801971\n",
"5 0.633016 0.949739 0.607891 0.683648 0.968143\n",
"6 0.115978 0.949826 0.714890 0.044798 0.045687\n",
"7 0.194906 0.231508 0.226611 0.300160 0.274679\n",
"8 0.803793 0.338849 0.441591 0.912185 0.433098\n",
"9 0.999957 0.051630 0.180992 0.302551 0.769070\n",
"10 0.574207 0.783791 0.540075 0.179174 0.894027\n",
"11 0.569649 0.371187 0.195506 0.594435 0.400661\n",
"12 0.591930 0.032947 0.252379 0.805511 0.200745\n",
"13 0.942114 0.786830 0.649482 0.539290 0.229414\n",
"14 0.380885 0.768157 0.886803 0.509483 0.054171\n",
"15 0.752660 0.208055 0.084177 0.714361 0.550432\n",
"16 0.350011 0.312253 0.605291 0.356329 0.355951\n",
"17 0.603843 0.572656 0.800068 0.381215 0.382544\n",
"18 0.321610 0.778591 0.626828 0.018112 0.221135\n",
"19 0.332763 0.619369 0.669151 0.406251 0.625315"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.random.rand(100).reshape(20,5))\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6b20ea4b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.777040 | \n",
" 0.990383 | \n",
" 0.828056 | \n",
" 0.925998 | \n",
" 0.124980 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.865153 | \n",
" 0.293211 | \n",
" 0.334427 | \n",
" 0.038179 | \n",
" 0.412453 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.284549 | \n",
" 0.950626 | \n",
" 0.354305 | \n",
" 0.588351 | \n",
" 0.485069 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.559972 | \n",
" 0.819743 | \n",
" 0.247472 | \n",
" 0.197295 | \n",
" 0.882691 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.528820 | \n",
" 0.313048 | \n",
" 0.418015 | \n",
" 0.311277 | \n",
" 0.801971 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4\n",
"0 0.777040 0.990383 0.828056 0.925998 0.124980\n",
"1 0.865153 0.293211 0.334427 0.038179 0.412453\n",
"2 0.284549 0.950626 0.354305 0.588351 0.485069\n",
"3 0.559972 0.819743 0.247472 0.197295 0.882691\n",
"4 0.528820 0.313048 0.418015 0.311277 0.801971"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "light-diary",
"metadata": {},
"source": [
"The `head()` method, if used on Jupyter\n",
"Notebook, shows the first five lines of the DataFrame with its particular\n",
"formatting seen previously, as follows:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cab386b5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" 15 | \n",
" 0.752660 | \n",
" 0.208055 | \n",
" 0.084177 | \n",
" 0.714361 | \n",
" 0.550432 | \n",
"
\n",
" \n",
" 16 | \n",
" 0.350011 | \n",
" 0.312253 | \n",
" 0.605291 | \n",
" 0.356329 | \n",
" 0.355951 | \n",
"
\n",
" \n",
" 17 | \n",
" 0.603843 | \n",
" 0.572656 | \n",
" 0.800068 | \n",
" 0.381215 | \n",
" 0.382544 | \n",
"
\n",
" \n",
" 18 | \n",
" 0.321610 | \n",
" 0.778591 | \n",
" 0.626828 | \n",
" 0.018112 | \n",
" 0.221135 | \n",
"
\n",
" \n",
" 19 | \n",
" 0.332763 | \n",
" 0.619369 | \n",
" 0.669151 | \n",
" 0.406251 | \n",
" 0.625315 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4\n",
"15 0.752660 0.208055 0.084177 0.714361 0.550432\n",
"16 0.350011 0.312253 0.605291 0.356329 0.355951\n",
"17 0.603843 0.572656 0.800068 0.381215 0.382544\n",
"18 0.321610 0.778591 0.626828 0.018112 0.221135\n",
"19 0.332763 0.619369 0.669151 0.406251 0.625315"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.tail()"
]
},
{
"cell_type": "markdown",
"id": "modified-chester",
"metadata": {},
"source": [
"These two methods allow us to see the first and last five lines of the\n",
"DataFrame. This default behavior can be changed by passing the number of\n",
"lines to be displayed as the argument of the two methods. For example, if\n",
"we want to see only the first three rows of the DataFrame, we will get the\n",
"results, as shown below"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d56ce613",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.777040 | \n",
" 0.990383 | \n",
" 0.828056 | \n",
" 0.925998 | \n",
" 0.124980 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.865153 | \n",
" 0.293211 | \n",
" 0.334427 | \n",
" 0.038179 | \n",
" 0.412453 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.284549 | \n",
" 0.950626 | \n",
" 0.354305 | \n",
" 0.588351 | \n",
" 0.485069 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4\n",
"0 0.777040 0.990383 0.828056 0.925998 0.124980\n",
"1 0.865153 0.293211 0.334427 0.038179 0.412453\n",
"2 0.284549 0.950626 0.354305 0.588351 0.485069"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(3)"
]
},
{
"cell_type": "markdown",
"id": "98fd2ce2",
"metadata": {},
"source": [
"## Selection"
]
},
{
"cell_type": "markdown",
"id": "direct-stupid",
"metadata": {},
"source": [
"In the previous chapter, during the creation of the DataFrame, we already\n",
"learned how to select the elements inside it. In this section, we will address\n",
"the selection in the foreground, deepening this topic which will then be the\n",
"basis of all the other features present in this and subsequent chapters. It is,\n",
"therefore, crucial to start with this topic to better understand the concepts,\n",
"and then to address all the subsequent topics in this book with greater\n",
"clarity."
]
},
{
"cell_type": "markdown",
"id": "anonymous-antibody",
"metadata": {},
"source": [
"## Subsetting by index or by position"
]
},
{
"cell_type": "markdown",
"id": "available-finland",
"metadata": {},
"source": [
"\n",
"Structured objects such as DataFrame and Series, have been created\n",
"specifically to carry out the selection process on them in a simple and\n",
"efficient way. Within these objects, each dimension is characterized by an\n",
"Index object, the purpose of which is to label the individual elements on\n",
"each of them. It is precisely the Index objects that differentiate DataFrame\n",
"and Series from the n-dimensional arrays (ndarray) of NumPy. On the latter,\n",
"since they have no index objects, the positions of the elements are specified\n",
"by means of a sequence of progressive integers from 0 to N-1 (where N is\n",
"the length of the dimension).\n",
"Since the Pandas library is based on the NumPy library, and in its way,\n",
"extends some of its features, it wanted to keep both the possibility of being\n",
"able to select the elements of the DataFrame and Series with the numbers of\n",
"the positions. Both extend this concept through the Indexes and their labels.\n",
"Subsetting can then be done via the following:\n",
"\n",
"* **Indexing:** Users can select the data using the labels that distinguish\n",
"the individual rows and columns, through the Index objects.\n",
"* **Integer Positioning:** Users can select data using positions for integers\n",
"of rows and columns, such as NumPy ndarrays."
]
},
{
"cell_type": "markdown",
"id": "asian-reach",
"metadata": {},
"source": [
"When you want to select a single element, things are quite simple – you\n",
"specify the label corresponding to the row and the corresponding column\n",
"(indexing) or the position within the DataFrame by means of the pair of\n",
"integers (integer positioning) – as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3e409e98",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 0 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" b | \n",
" 6 | \n",
" 5 | \n",
" 2 | \n",
" 5 | \n",
"
\n",
" \n",
" c | \n",
" 8 | \n",
" 3 | \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
" d | \n",
" 5 | \n",
" 2 | \n",
" 3 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"a 1 0 5 6\n",
"b 6 5 2 5\n",
"c 8 3 2 6\n",
"d 5 2 3 6"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.random.randint(0,9,16).reshape(4,4),\n",
" index = ['a','b','c','d'],\n",
" columns = ['A','B','C','D'])\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "ec00d99c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1\n",
"1\n"
]
}
],
"source": [
"print(df.at['a','A']) #indexing\n",
"print(df.iat[0,0]) #integer positioning"
]
},
{
"cell_type": "markdown",
"id": "electoral-phoenix",
"metadata": {},
"source": [
"But most of the time, we need to select certain subsets of the DataFrame\n",
"and not a single element. So, we will have to extend the single values\n",
"(integer or label) with words that somehow express a particular subset on\n",
"which to make the selection."
]
},
{
"cell_type": "markdown",
"id": "aerial-polymer",
"metadata": {},
"source": [
"Selection operators and indexers accept the following terms in this regard:\n",
"* Lists (integer or labels)\n",
"* Slices (slicing)"
]
},
{
"cell_type": "markdown",
"id": "biblical-keeping",
"metadata": {},
"source": [
"**Lists** specify all the columns or rows we want to include in the subsetting.\n",
"As the name implies, they are simple Python lists containing integer values\n",
"or labels corresponding to the rows or columns to be selected.\n",
"\n",
"**Slices,** on the other hand, specify ranges of indexes, thus, including in a\n",
"single wording, a set of contiguous columns or rows, without having to\n",
"specify them individually. To define the slices, the slicing techniques are\n",
"used with the ‘:’ operator (the same way as in NumPy with ndarray)."
]
},
{
"cell_type": "markdown",
"id": "tamil-tackle",
"metadata": {},
"source": [
"## loc[ ] – Selection by Labels"
]
},
{
"cell_type": "markdown",
"id": "fundamental-specialist",
"metadata": {},
"source": [
"\n",
"The `loc[ ]` operator focuses on selection based on the labels contained\n",
"within the Index objects, as shown as follows:\n",
"\n",
"`loc[ row labels, column labels]`\n",
"\n",
"If you want to select a single element within the DataFrame, you will first\n",
"specify the row label and then that of the column, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "altered-layout",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n"
]
}
],
"source": [
"print(df.loc['a','B'])"
]
},
{
"cell_type": "markdown",
"id": "billion-breach",
"metadata": {},
"source": [
"A single element as a scalar is obtained as a return value.\n",
"Whereas, if you want to select a subset of adjacent elements, you can use\n",
"the slicing technique with labels, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "9506007c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" A B C\n",
"b 0 2 1\n",
"c 5 1 0\n",
"d 8 7 6\n"
]
}
],
"source": [
"print(df.loc['b':'d','A':'C']) "
]
},
{
"cell_type": "markdown",
"id": "greenhouse-expression",
"metadata": {},
"source": [
"In slicing, all the elements included between the extremes expressed by the\n",
"two labels are selected, including these. A smaller DataFrame is obtained as\n",
"a return value.\n",
"If, on the other hand, you want to select the entire range of columns or\n",
"rows, just write the slicing operator :, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "4e30f10d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" A B C D\n",
"a 7 7 2 3\n",
"b 0 2 1 3\n",
"c 5 1 0 8\n"
]
}
],
"source": [
"print(df.loc['a':'c', :]) "
]
},
{
"cell_type": "markdown",
"id": "mental-nation",
"metadata": {},
"source": [
">#### Exercise : \n",
"Try the following\n",
"1. `df.loc[['a','c'], ['A','D']]`\n",
"\n",
"2. `df.loc['a':'c', 'A':'B'].join(df.loc['a':'c', 'D'])`\n",
"\n",
"3. `df.loc[:,['A','B','D']].loc['a':'c', :]`\n"
]
},
{
"cell_type": "markdown",
"id": "artificial-generic",
"metadata": {},
"source": [
"## iloc[ ] – Selection by location\n"
]
},
{
"cell_type": "markdown",
"id": "incorporated-remains",
"metadata": {},
"source": [
"The `iloc[ ]` operator works with the same rules as the selection operator []\n",
"for lists in Python. That is, it focuses on the positions of the elements\n",
"expressed by sequences of integers from 0 to N-1 with N as the length of the\n",
"Index, as shown as follows:"
]
},
{
"cell_type": "markdown",
"id": "partial-worcester",
"metadata": {},
"source": [
"`iloc[ row number, column number]`"
]
},
{
"cell_type": "markdown",
"id": "broad-seven",
"metadata": {},
"source": [
"So, if we want to select a single element, it will be enough to add its\n",
"numerical coordinates of the corresponding position, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "5e716bbc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.iloc[1,3]"
]
},
{
"cell_type": "markdown",
"id": "general-script",
"metadata": {},
"source": [
"Whereas, if we want to select a subset of the DataFrame, we can use the\n",
"slicing technique by specifying the ranges of the numerical indices if they\n",
"are adjacent, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "2c4154cf",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 0 | \n",
" 5 | \n",
"
\n",
" \n",
" b | \n",
" 5 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" B C\n",
"a 0 5\n",
"b 5 2"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.iloc[0:2,1:3]"
]
},
{
"cell_type": "markdown",
"id": "younger-crystal",
"metadata": {},
"source": [
"And if the elements are not adjacent, we can replace slices with lists, as\n",
"shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "e5bc6cae",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 6 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A C\n",
"b 6 2"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.iloc[[1],[0,2]]"
]
},
{
"cell_type": "markdown",
"id": "legislative-breach",
"metadata": {},
"source": [
"### Selection on dtype\n",
"So far, we have seen that the selections can be made by subsetting by\n",
"exploiting the indexing of the DataFrame. But there are also other selection\n",
"methods, for instance, based on the values contained in the DataFrame. For\n",
"example, selections can be made based on the type of data (`dtype`).\n",
"To make this type of selection, we do not use an operator, but a method\n",
"called `select_dtypes()`. This method accepts the following two possible\n",
"parameters:\n",
"\n",
"* include\n",
"* exclude\n",
"\n",
"A list containing the types of data (dtype) to be included or excluded in the\n",
"selection will be passed on to these parameters. In this regard, we create a\n",
"DataFrame containing different types of data for each column, as shown as\n",
"follows:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "df73cd5d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 6 | \n",
" 0.3 | \n",
" one | \n",
" True | \n",
"
\n",
" \n",
" b | \n",
" 2 | \n",
" 5.1 | \n",
" two | \n",
" False | \n",
"
\n",
" \n",
" c | \n",
" 1 | \n",
" 4.3 | \n",
" three | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"a 6 0.3 one True\n",
"b 2 5.1 two False\n",
"c 1 4.3 three True"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame([[6,0.3,'one', True],[2,5.1,'two', False],[1,4.3,'three', True]], \n",
" index=['a','b','c'], \n",
" columns=['A','B','C','D'])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "afraid-pizza",
"metadata": {},
"source": [
"To obtain the list of the dtypes of each single column of the DataFrame,\n",
"we can call the dtypes attribute, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "7a789dc3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A int64\n",
"B float64\n",
"C object\n",
"D bool\n",
"dtype: object"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"id": "worth-thriller",
"metadata": {},
"source": [
"At this point, based on `dtypes`, we can select by passing, for example, a list\n",
"containing the dtypes we want in the include parameter, as shown as\n",
"follows:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "08474bd8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 6 | \n",
" True | \n",
"
\n",
" \n",
" b | \n",
" 2 | \n",
" False | \n",
"
\n",
" \n",
" c | \n",
" 1 | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A D\n",
"a 6 True\n",
"b 2 False\n",
"c 1 True"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.select_dtypes(include=['bool','int64'])\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "musical-irrigation",
"metadata": {},
"source": [
"If, on the other hand, we want to select by exclusion, we use the exclude\n",
"parameter instead with a list containing the `dtypes` we want to exclude, as\n",
"shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "6ce1045b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" B C\n",
"a 0.3 one\n",
"b 5.1 two\n",
"c 4.3 three\n"
]
}
],
"source": [
"df2 = df.select_dtypes(exclude=['bool','int64'])\n",
"print(df2)"
]
},
{
"cell_type": "markdown",
"id": "8fd36488",
"metadata": {},
"source": [
"## Filtering"
]
},
{
"cell_type": "markdown",
"id": "backed-internet",
"metadata": {},
"source": [
"We saw in the previous section how it is possible to select elements within a\n",
"DataFrame (or even a Series) by specifying appropriate spatial indicators\n",
"such as labels or numerical position indices. In this section, we will see\n",
"another very similar operation, which also selects, but is totally focused on\n",
"the values of the elements – **filtering**. In this case, we will set particular\n",
"conditions in the selection designed to select only the elements with the\n",
"values that satisfy them. There is, therefore, filtering of the values of a\n",
"DataFrame."
]
},
{
"cell_type": "markdown",
"id": "tribal-differential",
"metadata": {},
"source": [
"### The Boolean condition\n",
"The Filtering technique is, therefore, based on a particular condition, called\n",
"a Boolean condition, which is nothing more than a Boolean expression, that\n",
"is, a mathematical expression that results in True and False (Boolean\n",
"values). An example of a Boolean expression could be as follows:\n",
"\n",
"`x > 5 (values greater than 5)`"
]
},
{
"cell_type": "markdown",
"id": "electrical-management",
"metadata": {},
"source": [
"If this condition is true, it will return True, otherwise False. These Boolean\n",
"expressions are characterized by the use of Boolean operators such as (<,>,!\n",
"=, ==, <=,> =), which are intended to compare two values and set a\n",
"condition that can either be verified or not.\n",
"In the case of DataFrames, this Boolean condition is applied to the values\n",
"contained in a particular subset of the DataFrame, establishing which part\n",
"(column or row) will respond to this condition and returning a Series of\n",
"Boolean values (`True` or `False`). The final result is a filtering that returns us\n",
"a DataFrame that preserves the same structure as the original but that\n",
"includes only some rows or columns of the original DataFrame."
]
},
{
"cell_type": "markdown",
"id": "verbal-kingdom",
"metadata": {},
"source": [
"### Filtering on the lines\n",
"The most intuitive filtering is filtering on lines, that is, the one that will pass\n",
"only a few lines of a DataFrame based on a Boolean condition.\n",
"We first create a DataFrame of random values between 0 and 1, distributed\n",
"over 4 columns, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "bd38e350",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" A B C D\n",
"0 0.497878 0.753872 0.316998 0.897446\n",
"1 0.346698 0.172985 0.023653 0.589517\n",
"2 0.440887 0.719287 0.622692 0.280825\n",
"3 0.776862 0.214410 0.576098 0.220265\n",
"4 0.652671 0.033319 0.093250 0.069738\n",
"5 0.153605 0.003893 0.080816 0.195577\n",
"6 0.835020 0.539567 0.384164 0.991319\n",
"7 0.355455 0.911003 0.893413 0.923197\n",
"8 0.295068 0.667123 0.554785 0.846672\n",
"9 0.523384 0.465534 0.286456 0.204596\n"
]
}
],
"source": [
"df = pd.DataFrame(np.random.rand(40).reshape(10,4), columns = ['A','B','C','D'])\n",
"print(df)"
]
},
{
"cell_type": "markdown",
"id": "funded-bradley",
"metadata": {},
"source": [
"Now, we will define a Boolean condition, based on which the selection will\n",
"be made. For example, we want to filter only those rows of a DataFrame\n",
"that have the value of column A greater than 0.5. Since these are floatingpoint numbers, let’s add some significant digits, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "chronic-company",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 35,
"id": "b6186cda",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 True\n",
"4 True\n",
"5 False\n",
"6 True\n",
"7 False\n",
"8 False\n",
"9 True\n",
"Name: A, dtype: bool"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['A'] > 0.500000"
]
},
{
"cell_type": "markdown",
"id": "inner-rebel",
"metadata": {},
"source": [
"As an indexing operator, we use the one represented by two square brackets\n",
"`[ ]`. This operator will then take care of passing only the rows corresponding\n",
"to the True values, based on the indices of the Series generated by the\n",
"Boolean condition, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "de324a7c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 6 | \n",
" 0.3 | \n",
" one | \n",
" True | \n",
"
\n",
" \n",
" b | \n",
" 2 | \n",
" 5.1 | \n",
" two | \n",
" False | \n",
"
\n",
" \n",
" c | \n",
" 1 | \n",
" 4.3 | \n",
" three | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"a 6 0.3 one True\n",
"b 2 5.1 two False\n",
"c 1 4.3 three True"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df[ df['A'] > 0.500000 ]\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "hired-intellectual",
"metadata": {},
"source": [
"But you can also use other indexing operators, such as indexer loc[], as\n",
"shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "189475cf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" A B C D\n",
"3 0.776862 0.214410 0.576098 0.220265\n",
"4 0.652671 0.033319 0.093250 0.069738\n",
"6 0.835020 0.539567 0.384164 0.991319\n",
"9 0.523384 0.465534 0.286456 0.204596\n"
]
}
],
"source": [
"df2 = df.loc[ df['A'] > 0.500000]\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "australian-storm",
"metadata": {},
"source": [
"The result is pretty much the same."
]
},
{
"cell_type": "markdown",
"id": "advance-british",
"metadata": {},
"source": [
"### Filtering on columns\n"
]
},
{
"cell_type": "markdown",
"id": "union-prescription",
"metadata": {},
"source": [
"The complementary case to the previous one is filtering on columns. In this\n",
"case, only those columns of the DataFrame are filtered, whose values meet\n",
"certain requirements. Again, we will define a Boolean condition. For\n",
"example, we want to select only those columns whose average of values is\n",
"greater than 0.5, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "ddfb82ca",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A True\n",
"B True\n",
"D True\n",
"dtype: bool"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.mean() > 0.500000"
]
},
{
"cell_type": "markdown",
"id": "prospective-complement",
"metadata": {},
"source": [
"### Application of multiple conditions"
]
},
{
"cell_type": "markdown",
"id": "roman-algeria",
"metadata": {},
"source": [
"\n",
"In Filtering, you can apply several Boolean conditions at the same time. To\n",
"do this, different conditions can be applied at the same time, thanks to the\n",
"logical operators `AND`, `OR`, and `NOT`. In the Pandas library, these logical\n",
"operators are expressed in the following way:"
]
},
{
"cell_type": "markdown",
"id": "recognized-liberia",
"metadata": {},
"source": [
"* & (and)\n",
"* | (or)\n",
"* ~ (not)"
]
},
{
"cell_type": "markdown",
"id": "collective-mustang",
"metadata": {},
"source": [
"So, every single condition is enclosed in square brackets and joined\n",
"together by the previous logical operators. We can express joint Boolean\n",
"conditions in the following way:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "96a04e65",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a True\n",
"b True\n",
"c True\n",
"Name: A, dtype: bool"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(df['A'] < 0.300000) | (df['A'] > 0.700000)"
]
},
{
"cell_type": "markdown",
"id": "signal-potential",
"metadata": {},
"source": [
"Apply it inside an indexing operator or an indexer, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "e535e315",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 6 | \n",
" 0.3 | \n",
" one | \n",
" True | \n",
"
\n",
" \n",
" b | \n",
" 2 | \n",
" 5.1 | \n",
" two | \n",
" False | \n",
"
\n",
" \n",
" c | \n",
" 1 | \n",
" 4.3 | \n",
" three | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"a 6 0.3 one True\n",
"b 2 5.1 two False\n",
"c 1 4.3 three True"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df[(df['A'] < 0.300000) | (df['A'] > 0.700000)]\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "disturbed-launch",
"metadata": {},
"source": [
"### Boolean reductions"
]
},
{
"cell_type": "markdown",
"id": "static-guatemala",
"metadata": {},
"source": [
"\n",
"One topic related to the concept of filtering is Boolean reductions. There are\n",
"a number of methods, called reduction methods, which allow you to obtain\n",
"a reduced result, as shown as follows:\n",
"\n",
"* any()\n",
"* all()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "automatic-squad",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.358832 | \n",
" 0.231811 | \n",
" 0.824005 | \n",
" 0.000521 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.407515 | \n",
" 0.319009 | \n",
" 0.505891 | \n",
" 0.117052 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.202044 | \n",
" 0.530385 | \n",
" 0.671359 | \n",
" 0.275782 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.120274 | \n",
" 0.025573 | \n",
" 0.925733 | \n",
" 0.122130 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.949501 | \n",
" 0.075059 | \n",
" 0.518535 | \n",
" 0.503486 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.608851 | \n",
" 0.921847 | \n",
" 0.773583 | \n",
" 0.887049 | \n",
"
\n",
" \n",
" 6 | \n",
" 0.484373 | \n",
" 0.133161 | \n",
" 0.602327 | \n",
" 0.743169 | \n",
"
\n",
" \n",
" 7 | \n",
" 0.264981 | \n",
" 0.067251 | \n",
" 0.803267 | \n",
" 0.589700 | \n",
"
\n",
" \n",
" 8 | \n",
" 0.049802 | \n",
" 0.324119 | \n",
" 0.483424 | \n",
" 0.694372 | \n",
"
\n",
" \n",
" 9 | \n",
" 0.453083 | \n",
" 0.183333 | \n",
" 0.555621 | \n",
" 0.176439 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"0 0.358832 0.231811 0.824005 0.000521\n",
"1 0.407515 0.319009 0.505891 0.117052\n",
"2 0.202044 0.530385 0.671359 0.275782\n",
"3 0.120274 0.025573 0.925733 0.122130\n",
"4 0.949501 0.075059 0.518535 0.503486\n",
"5 0.608851 0.921847 0.773583 0.887049\n",
"6 0.484373 0.133161 0.602327 0.743169\n",
"7 0.264981 0.067251 0.803267 0.589700\n",
"8 0.049802 0.324119 0.483424 0.694372\n",
"9 0.453083 0.183333 0.555621 0.176439"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.random.rand(40).reshape(10,4), columns = ['A','B','C','D'])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "internal-helen",
"metadata": {},
"source": [
"But, let’s see some examples to clarify their function. We apply a Boolean\n",
"condition in which all the elements of the DataFrame are evaluated. For\n",
"example, we want to know which elements of the DataFrame have a value\n",
"greater than 0.1, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "f79e6553",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" True | \n",
" True | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 3 | \n",
" True | \n",
" False | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 4 | \n",
" True | \n",
" False | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 5 | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 6 | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 7 | \n",
" True | \n",
" False | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 8 | \n",
" False | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 9 | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"0 True True True False\n",
"1 True True True True\n",
"2 True True True True\n",
"3 True False True True\n",
"4 True False True True\n",
"5 True True True True\n",
"6 True True True True\n",
"7 True False True True\n",
"8 False True True True\n",
"9 True True True True"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df > 0.1\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "dedicated-diploma",
"metadata": {},
"source": [
"In this case, we got a DataFrame full of Boolean values. A reduction\n",
"method can be applied to this DataFrame, such as the `all()` method to\n",
"know if, within each column, all the values meet these requirements (all on\n",
"True) , as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "e28120ef",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A False\n",
"B False\n",
"C True\n",
"D False\n",
"dtype: bool"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(df > 0.1).all()"
]
},
{
"cell_type": "markdown",
"id": "apparent-isaac",
"metadata": {},
"source": [
">#### Exercise : \n",
"What do the following command do?\n",
"1. `(df > 0.1).any()`\n",
"\n",
"2. `(df > 0.1).any().any()`\n"
]
},
{
"cell_type": "markdown",
"id": "connected-shape",
"metadata": {},
"source": [
"### Filtering with isin()"
]
},
{
"cell_type": "markdown",
"id": "according-detection",
"metadata": {},
"source": [
"\n",
"Another method of Filtering is `isin( )`. This method lets you know if one or\n",
"more elements are present inside an object such as a DataFrame or a Series.\n",
"In this regard, we define an example DataFrame, with predefined values, as\n",
"shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "a59beb0f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 0 1 2\n",
"1 3 4 5\n",
"2 6 7 8"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.arange(9).reshape(3,3), columns = ['A','B','C'])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "backed-silicon",
"metadata": {},
"source": [
"Now, let’s apply the `isin( )`method with some values to search for, as\n",
"follows:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "ff2ab9f0",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" True | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" True | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 True False False\n",
"1 True False False\n",
"2 False False False"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.isin([0,3])\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "veterinary-january",
"metadata": {},
"source": [
"Then, passing this Boolean DataFrame to the indexing operator, you get a\n",
"filtering of the values, in which only those which correspond to True as a\n",
"value will be passed, while all the others, which correspond to False, will be\n",
"evaluated as NaN, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "4fa7284d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 3.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 0.0 NaN NaN\n",
"1 3.0 NaN NaN\n",
"2 NaN NaN NaN"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df2]"
]
},
{
"cell_type": "markdown",
"id": "fiscal-indonesia",
"metadata": {},
"source": [
" For example, we can enhance them with other values, or\n",
"completely remove them from the DataFrame, leaving only the enhanced\n",
"elements, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "7c00583a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 3.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A\n",
"0 0.0\n",
"1 3.0"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df[df.isin([0,3])].dropna(thresh=1).dropna(axis=1)\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "weird-crystal",
"metadata": {},
"source": [
"## Adding, inserting, and deleting a column in a DataFrame\n"
]
},
{
"cell_type": "markdown",
"id": "vanilla-affairs",
"metadata": {},
"source": [
"You will have understood by now that due to the way the DataFrames are\n",
"structured within them, you can think of the columns as separate Series,\n",
"assembled together to form data tables. So the editing operations of a\n",
"DataFrame working on the columns are somewhat easier."
]
},
{
"cell_type": "markdown",
"id": "9af264df",
"metadata": {},
"source": [
"## Editing"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "40b8aa42",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"a 1 2 3\n",
"b 4 5 6\n",
"c 7 8 9"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], \n",
" index=['a','b','c'], \n",
" columns=['A','B','C'])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "hazardous-vaccine",
"metadata": {},
"source": [
"Adding a column at the end is a fairly simple operation. It is sufficient to\n",
"select a column that does not yet exist, specify a new label within the\n",
"indexing operator, and enhance it in some way. In this way, a new column\n",
"will automatically be created which will be added after the last existing\n",
"column.\n",
"For example, if we value a non-existent column with a scalar value, , we\n",
"will get a column completely valued with that number, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "e31547db",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 17 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 17 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 17 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"a 1 2 3 17\n",
"b 4 5 6 17\n",
"c 7 8 9 17"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['D'] = 17 \n",
"df"
]
},
{
"cell_type": "markdown",
"id": "sacred-phoenix",
"metadata": {},
"source": [
"From now on, if you select this column, the behavior will be the normal\n",
"selection behavior, where the returned value will be a Series, as shown as\n",
"follows:"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "e79fc990",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a 17\n",
"b 17\n",
"c 17\n",
"Name: D, dtype: int64"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['D'] "
]
},
{
"cell_type": "markdown",
"id": "comprehensive-hybrid",
"metadata": {},
"source": [
"If we wanted to create a new column with all different values, instead of a\n",
"single scalar, we can pass a list containing all the values, as shown as\n",
"follows:"
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "ebe1903b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
" E | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 17 | \n",
" 13 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 17 | \n",
" 14 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 17 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D E\n",
"a 1 2 3 17 13\n",
"b 4 5 6 17 14\n",
"c 7 8 9 17 15"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['E'] = [13,14,15]\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "nutritional-annotation",
"metadata": {},
"source": [
"The same technique can be used to update the values of an already existing\n",
"column. For example, passing a list of values to an already existing column\n",
"will replace those already present, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "bff04f5f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
" E | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" one | \n",
" 13 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" two | \n",
" 14 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" three | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D E\n",
"a 1 2 3 one 13\n",
"b 4 5 6 two 14\n",
"c 7 8 9 three 15"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['D'] = ['one','two', 'three']\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "tough-finish",
"metadata": {},
"source": [
"To enhance a column, you can use an existing Series instead of a list. In this\n",
"case, however, you must pay close attention to the correspondence between\n",
"the labels of the Series and those of the DataFrame, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "d20806d9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
" E | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 10 | \n",
" 13 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 11 | \n",
" 14 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 12 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D E\n",
"a 1 2 3 10 13\n",
"b 4 5 6 11 14\n",
"c 7 8 9 12 15"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sr = pd.Series([10,11,12],index=['a','b','c'])\n",
"df['D'] = sr \n",
"df"
]
},
{
"cell_type": "markdown",
"id": "dramatic-outside",
"metadata": {},
"source": [
"Another very useful operation is the removal of a column from a\n",
"DataFrame. The del keyword exists in Python that can be used for this\n",
"purpose, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "e51d50c7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 10 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 11 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 12 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"a 1 2 3 10\n",
"b 4 5 6 11\n",
"c 7 8 9 12"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"del df['E']\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "adult-paragraph",
"metadata": {},
"source": [
"In this case, the column is permanently removed from the DataFrame, but\n",
"there may be cases where the removed column wants to be preserved in\n",
"some way. The pop() method allows you to remove the column from the\n",
"DataFrame and return it as Series that can be stored in a new variable, as\n",
"shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "10780e37",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"a 1 2 3\n",
"b 4 5 6\n",
"c 7 8 9"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sr = df.pop('D') \n",
"print(type(sr)) \n",
"df"
]
},
{
"cell_type": "markdown",
"id": "limited-bishop",
"metadata": {},
"source": [
"Another useful method for editing a column within a DataFrame is the\n",
"insert() method. This method inserts a new column within a DataFrame\n",
"in a predetermined position.\n",
"For example, we can insert the Series we extracted earlier with the `pop()`\n",
"method, back into the DataFrame in position 1, calling it as label `'Z'`. All\n",
"this information is passed as arguments to the `insert()` function, as shown\n",
"as follows:"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "38a02c4e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" Z | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 10 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 11 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 12 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A Z B C\n",
"a 1 10 2 3\n",
"b 4 11 5 6\n",
"c 7 12 8 9"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.insert(1, 'Z', sr) \n",
"df"
]
},
{
"cell_type": "markdown",
"id": "exterior-wilson",
"metadata": {},
"source": [
"The same `insert()` method can also be used to copy a column within a\n",
"DataFrame. For example, in the previous case, the column to be copied is\n",
"passed instead of an external Series, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "2076e6cd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" Z | \n",
" B | \n",
" B2 | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 10 | \n",
" 2 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 11 | \n",
" 5 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 12 | \n",
" 8 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A Z B B2 C\n",
"a 1 10 2 2 3\n",
"b 4 11 5 5 6\n",
"c 7 12 8 8 9"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.insert(3, 'B2', df['B'])\n",
"df\n"
]
},
{
"cell_type": "markdown",
"id": "ecological-arthritis",
"metadata": {},
"source": [
"## Adding, inserting, and deleting a row in a DataFrame\n"
]
},
{
"cell_type": "markdown",
"id": "duplicate-potato",
"metadata": {},
"source": [
"As for the editing of the rows, it is appropriate to make some small\n",
"considerations to understand how to make these operations very similar to\n",
"the case of columns.\n",
"First, let’s start again with a test DataFrame, as follows:"
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "ae896dbc",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"a 1 2 3\n",
"b 4 5 6\n",
"c 7 8 9"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], \n",
" index=['a','b','c'], \n",
" columns=['A','B','C'])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "deluxe-skirt",
"metadata": {},
"source": [
"We saw that in using the columns, we are privileged because we largely\n",
"exploit the internal structure of the DataFrame, which provides the columns\n",
"as a Series. So, just call a column with the corresponding label to get a\n",
"Series, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "57868738",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a 1\n",
"b 4\n",
"c 7\n",
"Name: A, dtype: int64"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['A']"
]
},
{
"cell_type": "markdown",
"id": "heard-charlotte",
"metadata": {},
"source": [
"This reality has been pushed into the syntax within the library to the point\n",
"of considering the columns as real attributes, without even using the\n",
"indexing operator anymore, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "7e3dbd5b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a 1\n",
"b 4\n",
"c 7\n",
"Name: A, dtype: int64"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.A"
]
},
{
"cell_type": "markdown",
"id": "waiting-illinois",
"metadata": {},
"source": [
"You can then add a row in the same way as for columns. For example, by\n",
"assigning a scalar value to a selection of a non-existing row, a new row is\n",
"added at the bottom of the DataFrame, whose values are the repetition of\n",
"that scalar value, and the label is the one passed in the selection operator, as\n",
"shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "c20cab26",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
" d | \n",
" 12 | \n",
" 12 | \n",
" 12 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"a 1 2 3\n",
"b 4 5 6\n",
"c 7 8 9\n",
"d 12 12 12"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['d'] = 12 \n",
"df"
]
},
{
"cell_type": "markdown",
"id": "numerical-question",
"metadata": {},
"source": [
"Or, you can change it by using a list of values, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 66,
"id": "better-machine",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
" d | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"a 1 2 3\n",
"b 4 5 6\n",
"c 7 8 9\n",
"d 10 11 12"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['d'] = [10,11,12]\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "subtle-entrance",
"metadata": {},
"source": [
"Or, instead of a list, you can change it by using an already existing Series,\n",
"as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "43af3795",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
" d | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"a 1 2 3\n",
"b 4 5 6\n",
"c 7 8 9\n",
"d 13 14 15"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sr = pd.Series([13,14,15], index=['A','B','C'], name=\"e\") #aggiungi riga con series\n",
"df.loc['d'] = sr\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "enormous-factor",
"metadata": {},
"source": [
"Then, to delete a row, use the following"
]
},
{
"cell_type": "code",
"execution_count": 68,
"id": "ced75096",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
" d | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"b 4 5 6\n",
"c 7 8 9\n",
"d 13 14 15"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.drop(['a'],axis=0) \n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "rotary-relevance",
"metadata": {},
"source": [
"Whereas, use the following to delete a column:"
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "23078a82",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" b | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
" d | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" B C\n",
"a 2 3\n",
"b 5 6\n",
"c 8 9\n",
"d 14 15"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3 = df.drop(['A'],axis=1) \n",
"df3"
]
},
{
"cell_type": "markdown",
"id": "ahead-duration",
"metadata": {},
"source": [
"If we want to delete multiple rows or columns at the same time, just add the\n",
"corresponding labels in the passed list as an argument, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 70,
"id": "257a13f8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" d | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"b 4 5 6\n",
"d 13 14 15"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.drop(['a','c'],axis=0) \n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "posted-amateur",
"metadata": {},
"source": [
"### Adding new columns with assign()\n"
]
},
{
"cell_type": "markdown",
"id": "streaming-jimmy",
"metadata": {},
"source": [
"Among the methods applicable to DataFrames that have editing functions,\n",
"there is the assign() method that allows you to create new columns to be\n",
"added easily, deriving them from those already present. For example, we\n",
"can add a new column to the DataFrame starting from the values of an\n",
"already existing column by multiplying it by 2, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 71,
"id": "819e0936",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 3 | \n",
" 7 | \n",
" 5 | \n",
"
\n",
" \n",
" b | \n",
" 8 | \n",
" 3 | \n",
" 9 | \n",
"
\n",
" \n",
" c | \n",
" 1 | \n",
" 3 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"a 3 7 5\n",
"b 8 3 9\n",
"c 1 3 6"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.random.randint(1,10,9).reshape(3,3), \n",
" index=['a','b','c'], \n",
" columns=['A','B','C'])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "composed-grenada",
"metadata": {},
"source": [
"The `assign ()` method does not modify the original DataFrame but creates\n",
"a new copy of it as a return value. So, the result can be assigned to a new\n",
"variable, as we have just done, or reassigned to the original DataFrame,\n",
"modifying it. The choice of what to do is therefore up to the user, as shown\n",
"as follows:"
]
},
{
"cell_type": "code",
"execution_count": 76,
"id": "34f35901",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 3 | \n",
" 7 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" b | \n",
" 8 | \n",
" 3 | \n",
" 9 | \n",
" 16 | \n",
"
\n",
" \n",
" c | \n",
" 1 | \n",
" 3 | \n",
" 6 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"a 3 7 5 6\n",
"b 8 3 9 16\n",
"c 1 3 6 2"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.assign(D = df['A']*2)\n",
"df2"
]
},
{
"cell_type": "code",
"execution_count": 77,
"id": "c19de8f8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 3 | \n",
" 7 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" b | \n",
" 8 | \n",
" 3 | \n",
" 9 | \n",
" 16 | \n",
"
\n",
" \n",
" c | \n",
" 1 | \n",
" 3 | \n",
" 6 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"a 3 7 5 6\n",
"b 8 3 9 16\n",
"c 1 3 6 2"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = df.assign(D = df['A']*2)\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "optional-dryer",
"metadata": {},
"source": [
"But the `assign()` method can also be used to overwrite an existing column,\n",
"which is often very useful for correcting or modifying the data within the\n",
"DataFrame, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 78,
"id": "942dde90",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 3 | \n",
" 7 | \n",
" 5 | \n",
" 1.732051 | \n",
"
\n",
" \n",
" b | \n",
" 8 | \n",
" 3 | \n",
" 9 | \n",
" 2.828427 | \n",
"
\n",
" \n",
" c | \n",
" 1 | \n",
" 3 | \n",
" 6 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"a 3 7 5 1.732051\n",
"b 8 3 9 2.828427\n",
"c 1 3 6 1.000000"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = df.assign(D = np.sqrt(df['A']))\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "1002ac4a",
"metadata": {},
"source": [
"## Descriptive Statistics"
]
},
{
"cell_type": "markdown",
"id": "likely-parameter",
"metadata": {},
"source": [
"Pandas is a library designed for data analysis, and the DataFrames are the\n",
"basic objects on which all the related activities are carried out. So having an\n",
"idea of their contents and their structure at the outset is important in order to\n",
"better target the operations to be carried out. Among the basic activities,\n",
"there is, therefore, the calculation of general statistics that can somehow\n",
"give an idea of the values contained within the DataFrame under analysis.\n",
"In this section, we will see many basic features that the Pandas library\n",
"offers to obtain this type of descriptive data."
]
},
{
"cell_type": "markdown",
"id": "radical-digest",
"metadata": {},
"source": [
"### The describe() method\n"
]
},
{
"cell_type": "markdown",
"id": "lasting-crossing",
"metadata": {},
"source": [
"Inside the Pandas library, there is a `describe()` method that calculates a\n",
"whole series of descriptive statistics of the reference DataFrame, as shown\n",
"as follows:"
]
},
{
"cell_type": "code",
"execution_count": 80,
"id": "e442da90",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.908880 | \n",
" 0.189278 | \n",
" 0.004759 | \n",
" 0.040333 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.903440 | \n",
" 0.909411 | \n",
" 0.983257 | \n",
" 0.130883 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.790432 | \n",
" 0.451688 | \n",
" 0.556863 | \n",
" 0.096279 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.506262 | \n",
" 0.235123 | \n",
" 0.165129 | \n",
" 0.133306 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.692320 | \n",
" 0.661209 | \n",
" 0.990101 | \n",
" 0.733084 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.420472 | \n",
" 0.988203 | \n",
" 0.198181 | \n",
" 0.013203 | \n",
"
\n",
" \n",
" 6 | \n",
" 0.132947 | \n",
" 0.140471 | \n",
" 0.549615 | \n",
" 0.212021 | \n",
"
\n",
" \n",
" 7 | \n",
" 0.148670 | \n",
" 0.659784 | \n",
" 0.630007 | \n",
" 0.206788 | \n",
"
\n",
" \n",
" 8 | \n",
" 0.372637 | \n",
" 0.065185 | \n",
" 0.199850 | \n",
" 0.351776 | \n",
"
\n",
" \n",
" 9 | \n",
" 0.272166 | \n",
" 0.677243 | \n",
" 0.164021 | \n",
" 0.233820 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3\n",
"0 0.908880 0.189278 0.004759 0.040333\n",
"1 0.903440 0.909411 0.983257 0.130883\n",
"2 0.790432 0.451688 0.556863 0.096279\n",
"3 0.506262 0.235123 0.165129 0.133306\n",
"4 0.692320 0.661209 0.990101 0.733084\n",
"5 0.420472 0.988203 0.198181 0.013203\n",
"6 0.132947 0.140471 0.549615 0.212021\n",
"7 0.148670 0.659784 0.630007 0.206788\n",
"8 0.372637 0.065185 0.199850 0.351776\n",
"9 0.272166 0.677243 0.164021 0.233820"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.random.rand(40).reshape(10,4))\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "frequent-destruction",
"metadata": {},
"source": [
"It is launched directly on the DataFrame to be analyzed, and the result is a\n",
"table of descriptive statistics carried out at the column level and reported in\n",
"the structure of a DataFrame, as shown below"
]
},
{
"cell_type": "code",
"execution_count": 81,
"id": "ab8fb2ae",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 10.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 0.514823 | \n",
" 0.497759 | \n",
" 0.444178 | \n",
" 0.215149 | \n",
"
\n",
" \n",
" std | \n",
" 0.294720 | \n",
" 0.329345 | \n",
" 0.352238 | \n",
" 0.207264 | \n",
"
\n",
" \n",
" min | \n",
" 0.132947 | \n",
" 0.065185 | \n",
" 0.004759 | \n",
" 0.013203 | \n",
"
\n",
" \n",
" 25% | \n",
" 0.297284 | \n",
" 0.200739 | \n",
" 0.173392 | \n",
" 0.104930 | \n",
"
\n",
" \n",
" 50% | \n",
" 0.463367 | \n",
" 0.555736 | \n",
" 0.374732 | \n",
" 0.170047 | \n",
"
\n",
" \n",
" 75% | \n",
" 0.765904 | \n",
" 0.673235 | \n",
" 0.611721 | \n",
" 0.228370 | \n",
"
\n",
" \n",
" max | \n",
" 0.908880 | \n",
" 0.988203 | \n",
" 0.990101 | \n",
" 0.733084 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3\n",
"count 10.000000 10.000000 10.000000 10.000000\n",
"mean 0.514823 0.497759 0.444178 0.215149\n",
"std 0.294720 0.329345 0.352238 0.207264\n",
"min 0.132947 0.065185 0.004759 0.013203\n",
"25% 0.297284 0.200739 0.173392 0.104930\n",
"50% 0.463367 0.555736 0.374732 0.170047\n",
"75% 0.765904 0.673235 0.611721 0.228370\n",
"max 0.908880 0.988203 0.990101 0.733084"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe() "
]
},
{
"cell_type": "markdown",
"id": "cathedral-detector",
"metadata": {},
"source": [
"The statistics processed are also returned in the form of DataFrame, each of\n",
"which is processed by a single column. Among the reported values, we see\n",
"the element count, the mean and its standard deviation, the maximum and\n",
"minimum value, and 3 percentiles. This information will be reported only\n",
"for the columns of the DataFrame that contain numeric values, as shown below"
]
},
{
"cell_type": "code",
"execution_count": 82,
"id": "2040f905",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" numb | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 5.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 3.000000 | \n",
"
\n",
" \n",
" std | \n",
" 1.581139 | \n",
"
\n",
" \n",
" min | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 2.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 3.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 4.000000 | \n",
"
\n",
" \n",
" max | \n",
" 5.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" numb\n",
"count 5.000000\n",
"mean 3.000000\n",
"std 1.581139\n",
"min 1.000000\n",
"25% 2.000000\n",
"50% 3.000000\n",
"75% 4.000000\n",
"max 5.000000"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = pd.DataFrame([[1,'one', True],[2,'two',False],\n",
" [3,'three',True],[4,'four',False],[5,'five', False]], \n",
" columns=['numb','word','bool'])\n",
"df2.describe()"
]
},
{
"cell_type": "code",
"execution_count": 87,
"id": "28a934ec",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" word | \n",
" bool | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 5 | \n",
" 5 | \n",
"
\n",
" \n",
" unique | \n",
" 5 | \n",
" 2 | \n",
"
\n",
" \n",
" top | \n",
" one | \n",
" False | \n",
"
\n",
" \n",
" freq | \n",
" 1 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" word bool\n",
"count 5 5\n",
"unique 5 2\n",
"top one False\n",
"freq 1 3"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2[['word','bool']].describe()"
]
},
{
"cell_type": "markdown",
"id": "fundamental-fourth",
"metadata": {},
"source": [
"So in a DataFrame with many columns, it would be a good practice to be\n",
"able to select only a few columns (or exclude the ones that don’t interest us)\n",
"without having to first make a selection of subsettings and then applying\n",
"`describe()` as done before. In this regard, there are two parameters,\n",
"include and exclude, which allow us to explicitly specify which columns\n",
"should be included or excluded in the calculation of descriptive statistics. The peculiarity is that you do not pass lists containing labels, but `dtypes`."
]
},
{
"cell_type": "code",
"execution_count": 83,
"id": "9b812b00",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" numb | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 5.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 3.000000 | \n",
"
\n",
" \n",
" std | \n",
" 1.581139 | \n",
"
\n",
" \n",
" min | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 2.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 3.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 4.000000 | \n",
"
\n",
" \n",
" max | \n",
" 5.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" numb\n",
"count 5.000000\n",
"mean 3.000000\n",
"std 1.581139\n",
"min 1.000000\n",
"25% 2.000000\n",
"50% 3.000000\n",
"75% 4.000000\n",
"max 5.000000"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.describe(include=['int'])"
]
},
{
"cell_type": "markdown",
"id": "ancient-longitude",
"metadata": {},
"source": [
"There are several methods to calculate descriptive statistics and other\n",
"related operations on Series and DataFrames. Those mainly taken into\n",
"consideration are a whole series of methods taken from the NumPy library\n",
"that are also applicable on the DataFrame. Many of these methods produce\n",
"aggregations, that is, they produce a smaller result by agglomerating a\n",
"group of values (rows or columns) under a single result. Aggregating\n",
"functions are, for example, `sum()`, `mean()`, `quantile()` which, if applied to\n",
"a DataFrame, aggregate its values returning a Series. Other methods of this\n",
"type produce results of the same size, and are, for example, functions like\n",
"`cumsum()`, and `cumprod()`. Generally, these methods take an axis parameter\n",
"passed as an argument to determine whether the calculation should be\n",
"evaluated by column `(axis = 0)` or by row `(axis = 1)`.\n",
"The statistical evaluation by column is the default one and you can also\n",
"omit to insert the axis parameter. For example, if you want to calculate the\n",
"mean of values in each column, you can easily use the mean() method, as\n",
"shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 84,
"id": "a8bce1e7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.514823\n",
"1 0.497759\n",
"2 0.444178\n",
"3 0.215149\n",
"dtype: float64"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.mean(0) #df.mean(axis=0) #df.mean()"
]
},
{
"cell_type": "markdown",
"id": "toxic-assist",
"metadata": {},
"source": [
"In fact, as many results are obtained as there are columns in the DataFrame.\n",
"Whereas, if we want to carry out the statistical evaluation by row, the value\n",
"1 must be specified (both with and without axis), as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 90,
"id": "8df99c58",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.820657\n",
"1 0.454654\n",
"2 0.587895\n",
"3 0.553482\n",
"4 0.513660\n",
"5 0.310291\n",
"6 0.712128\n",
"7 0.493875\n",
"8 0.357265\n",
"9 0.711689\n",
"dtype: float64"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.mean(1) #df.mean(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 91,
"id": "57468e02",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.319284\n",
"1 0.326032\n",
"2 0.289702\n",
"3 0.302796\n",
"dtype: float64"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.std()"
]
},
{
"cell_type": "markdown",
"id": "meaning-beginning",
"metadata": {},
"source": [
"## The standardization of data\n"
]
},
{
"cell_type": "markdown",
"id": "advised-casino",
"metadata": {},
"source": [
"To conclude this section, let’s add data standardization. This technique is\n",
"widely used in statistics and data analysis to make the different data series\n",
"better comparable to each other, making them “standard”. These data\n",
"groups are somehow “aligned” statistically in order to be able to compare\n",
"them more easily and highlight the differences. In a practical way,\n",
"standardization consists of making the mean of this group of values equal to\n",
"0 and the standard deviation unitary."
]
},
{
"cell_type": "code",
"execution_count": 92,
"id": "a5812aba",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 -1.110223e-17\n",
"1 1.249001e-16\n",
"2 6.661338e-17\n",
"3 -2.236709e-16\n",
"dtype: float64\n",
"0 1.0\n",
"1 1.0\n",
"2 1.0\n",
"3 1.0\n",
"dtype: float64\n"
]
}
],
"source": [
"std = ( df - df.mean())/ df.std()\n",
"print(std.mean())\n",
"print(std.std())"
]
},
{
"cell_type": "markdown",
"id": "c30f949c",
"metadata": {},
"source": [
"## Transposition, Sorting and Reindexing"
]
},
{
"cell_type": "markdown",
"id": "fewer-montreal",
"metadata": {},
"source": [
"In this final section, we will see some of the techniques that somehow\n",
"rearrange the internal structure of the DataFrame by acting at the index\n",
"level. In fact, the order of the labels within a DataFrame is not immutable\n",
"but can be changed during the analysis. There are, therefore, some\n",
"techniques that act at this level including transposition, sorting, and\n",
"reindexing."
]
},
{
"cell_type": "markdown",
"id": "fundamental-pittsburgh",
"metadata": {},
"source": [
"## Transposition"
]
},
{
"cell_type": "markdown",
"id": "liberal-press",
"metadata": {},
"source": [
"The transposition of a DataFrame is equivalent to the mathematical\n",
"operation of the transposition of a matrix. In this operation, the rows\n",
"become the columns and vice versa. Applying a transposition to a\n",
"DataFrame will, therefore, have the exchange of labels between index and\n",
"columns, with the consequent reordering of the values related to them, as\n",
"shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 85,
"id": "32bc9672",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"a 1 2 3\n",
"b 4 5 6\n",
"c 7 8 9"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], \n",
" index=['a','b','c'], \n",
" columns=['A','B','C'])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "intelligent-flight",
"metadata": {},
"source": [
"To carry out the transposition of a DataFrame, it is sufficient to recall the\n",
"attribute T, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 86,
"id": "fb4ca041",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
"
\n",
" \n",
" \n",
" \n",
" A | \n",
" 1 | \n",
" 4 | \n",
" 7 | \n",
"
\n",
" \n",
" B | \n",
" 2 | \n",
" 5 | \n",
" 8 | \n",
"
\n",
" \n",
" C | \n",
" 3 | \n",
" 6 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c\n",
"A 1 4 7\n",
"B 2 5 8\n",
"C 3 6 9"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.T"
]
},
{
"cell_type": "markdown",
"id": "excited-eight",
"metadata": {},
"source": [
"We, thus, obtain the transposed DataFrame."
]
},
{
"cell_type": "markdown",
"id": "according-person",
"metadata": {},
"source": [
"## Sorting\n",
"Often, when loading a DataFrame from a data source, it happens that the\n",
"order of the rows or columns is not the desired one. Or during the various\u0000\n",
"analysis phases, we may wish to sort the values of a specific column in the\n",
"ascending or descending order. All these sorting operations can be\n",
"summarized in the following three types:\n",
"* Sorting by labels\n",
"* Sorting by values\n",
"* Sorting by a combination of both\n",
"\n",
"All of these sorting types are feasible with the Panda library. For this\n",
"purpose, we will create a DataFrame with the inside of the index and\n",
"columns labels arranged in an unordered way (this is practically the case for\n",
"almost all real DataFrames), as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 87,
"id": "551875d3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" D | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 4 | \n",
" 7 | \n",
" 5 | \n",
" 8 | \n",
"
\n",
" \n",
" d | \n",
" 2 | \n",
" 4 | \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 9 | \n",
" 7 | \n",
" 6 | \n",
" 9 | \n",
"
\n",
" \n",
" a | \n",
" 9 | \n",
" 2 | \n",
" 4 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D A B\n",
"b 4 7 5 8\n",
"d 2 4 2 6\n",
"c 9 7 6 9\n",
"a 9 2 4 2"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.random.randint(10, size=16).reshape(4,4), \n",
" index=['b','d','c','a'], \n",
" columns=['C','D','A','B'])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "proud-elite",
"metadata": {},
"source": [
"### Sorting by label"
]
},
{
"cell_type": "markdown",
"id": "advance-fruit",
"metadata": {},
"source": [
"\n",
"Sorting by labels, or sorting by index, is based on sorting the labels in\n",
"axis or columns based on their value (numeric or string), and it is performed\n",
"efficiently by the `sort_index()` method. By default, this method puts the\n",
"labels on the index in the DataFrame in the ascending order. Matching the\n",
"values on each row will also follow the same order, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 88,
"id": "bd382e41",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" D | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 9 | \n",
" 2 | \n",
" 4 | \n",
" 2 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 7 | \n",
" 5 | \n",
" 8 | \n",
"
\n",
" \n",
" c | \n",
" 9 | \n",
" 7 | \n",
" 6 | \n",
" 9 | \n",
"
\n",
" \n",
" d | \n",
" 2 | \n",
" 4 | \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D A B\n",
"a 9 2 4 2\n",
"b 4 7 5 8\n",
"c 9 7 6 9\n",
"d 2 4 2 6"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.sort_index()\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "starting-tomato",
"metadata": {},
"source": [
"As we can see from the result, the rows of the DataFrame have been\n",
"reordered following the ascending order of the index labels. We, therefore,\n",
"had a sort order by row, while the order of the columns remained\n",
"unchanged. If, on the other hand, we need to sort the labels in descending\n",
"order, simply add the ascending parameter and set it to False (by default,\n",
"`ascending = True`), as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 89,
"id": "e8902394",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" D | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" d | \n",
" 2 | \n",
" 4 | \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 9 | \n",
" 7 | \n",
" 6 | \n",
" 9 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 7 | \n",
" 5 | \n",
" 8 | \n",
"
\n",
" \n",
" a | \n",
" 9 | \n",
" 2 | \n",
" 4 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D A B\n",
"d 2 4 2 6\n",
"c 9 7 6 9\n",
"b 4 7 5 8\n",
"a 9 2 4 2"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.sort_index(ascending=False)\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "lucky-stroke",
"metadata": {},
"source": [
"If we want to sort the labels in columns in ascending order, it will be\n",
"necessary to explicitly define the parameter `axis = 1` (by default `axis = 0`), as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 90,
"id": "b1d7218f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 5 | \n",
" 8 | \n",
" 4 | \n",
" 7 | \n",
"
\n",
" \n",
" d | \n",
" 2 | \n",
" 6 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" c | \n",
" 6 | \n",
" 9 | \n",
" 9 | \n",
" 7 | \n",
"
\n",
" \n",
" a | \n",
" 4 | \n",
" 2 | \n",
" 9 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"b 5 8 4 7\n",
"d 2 6 2 4\n",
"c 6 9 9 7\n",
"a 4 2 9 2"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.sort_index(axis=1)\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "excessive-charleston",
"metadata": {},
"source": [
"As you can see from the result, this time, it will be the column labels that\n",
"will be sorted in the ascending order. The same thing will be done for the\n",
"corresponding values on the columns. Also in this case, if we want to sort in\n",
"the descending order, we explicitly set `ascending = False`, as shown as\n",
"follows:"
]
},
{
"cell_type": "code",
"execution_count": 91,
"id": "68841931",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" D | \n",
" C | \n",
" B | \n",
" A | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 7 | \n",
" 4 | \n",
" 8 | \n",
" 5 | \n",
"
\n",
" \n",
" d | \n",
" 4 | \n",
" 2 | \n",
" 6 | \n",
" 2 | \n",
"
\n",
" \n",
" c | \n",
" 7 | \n",
" 9 | \n",
" 9 | \n",
" 6 | \n",
"
\n",
" \n",
" a | \n",
" 2 | \n",
" 9 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" D C B A\n",
"b 7 4 8 5\n",
"d 4 2 6 2\n",
"c 7 9 9 6\n",
"a 2 9 2 4"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.sort_index(axis=1, ascending=False)\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "chicken-slovakia",
"metadata": {},
"source": [
"If we want to sort by both row labels and column labels, we can do it safely\n",
"by chaining the methods. We sort the DataFrame first by row, and then later\n",
"by column, or vice versa, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 92,
"id": "134f1446",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 4 | \n",
" 2 | \n",
" 9 | \n",
" 2 | \n",
"
\n",
" \n",
" b | \n",
" 5 | \n",
" 8 | \n",
" 4 | \n",
" 7 | \n",
"
\n",
" \n",
" c | \n",
" 6 | \n",
" 9 | \n",
" 9 | \n",
" 7 | \n",
"
\n",
" \n",
" d | \n",
" 2 | \n",
" 6 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"a 4 2 9 2\n",
"b 5 8 4 7\n",
"c 6 9 9 7\n",
"d 2 6 2 4"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.sort_index(axis=0).sort_index(axis=1)\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "computational-rochester",
"metadata": {},
"source": [
"## Sorting by values\n",
"Another way to sort a DataFrame is to sort the rows or columns according\n",
"to their contained values. Sorting by values is done by the sort_values()\n",
"method. In order to work, this method needs a mandatory parameter that\n",
"must be explicitly set. The `by parameter` must be labeled columns,\n",
"according to which, the values within the column will be sorted in the\n",
"ascending order, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 93,
"id": "aba79a83",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" D | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" d | \n",
" 2 | \n",
" 4 | \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
" a | \n",
" 9 | \n",
" 2 | \n",
" 4 | \n",
" 2 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 7 | \n",
" 5 | \n",
" 8 | \n",
"
\n",
" \n",
" c | \n",
" 9 | \n",
" 7 | \n",
" 6 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D A B\n",
"d 2 4 2 6\n",
"a 9 2 4 2\n",
"b 4 7 5 8\n",
"c 9 7 6 9"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.sort_values(by='A')\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "assured-masters",
"metadata": {},
"source": [
"As can be seen from the result, the DataFrame has the values of column A\n",
"in the ascending order. Consequently, the order of the corresponding rows\n",
"of the DataFrame has also changed to follow that order.\n",
"The by parameter can also accept a list of column labels. The\n",
"`sort_values()` method will first sort the column values corresponding to\n",
"the first label in the list. Then, in the context of the equal values present in\n",
"this column, it will sort on the column corresponding to the second label in\n",
"the list, and so on, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 94,
"id": "4b6b4a2f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" C D A B\n",
"d 2 4 0 6\n",
"b 4 7 0 8\n",
"a 9 2 2 2\n",
"c 9 7 2 9\n"
]
}
],
"source": [
"df['A'] = [0,0,2,2]\n",
"df2 = df.sort_values(by=['A','B'])\n",
"print(df2)"
]
},
{
"cell_type": "markdown",
"id": "working-memory",
"metadata": {},
"source": [
"As we can see from the result, the values of column B will be ordered\n",
"limited to the range of identical values present in column A. Although, in\n",
"this simple example, it may be trivial, this combined sorting will be very\n",
"useful in DataFrame, where there are many rows with data on some\n",
"columns whose values repeat noticeably.\n",
"\n",
"Finally, both the `sort_index()` method and the `sort_values()` method will\n",
"not modify the original DataFrame but will return an ordered copy of it.\n",
"Any changes made to the latter will not affect the original DataFrame in the\n",
"least, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 96,
"id": "f313a8b7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" D | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 4 | \n",
" 7 | \n",
" 0 | \n",
" 8 | \n",
"
\n",
" \n",
" d | \n",
" 2 | \n",
" 4 | \n",
" 0 | \n",
" 6 | \n",
"
\n",
" \n",
" c | \n",
" 9 | \n",
" 7 | \n",
" 2 | \n",
" 9 | \n",
"
\n",
" \n",
" a | \n",
" 9 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D A B\n",
"b 4 7 0 8\n",
"d 2 4 0 6\n",
"c 9 7 2 9\n",
"a 9 2 2 2"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.sort_index(axis=0)\n",
"df2.loc['a','A'] = 0\n",
"df\n"
]
},
{
"cell_type": "code",
"execution_count": 97,
"id": "compact-appraisal",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" D | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 9 | \n",
" 2 | \n",
" 0 | \n",
" 2 | \n",
"
\n",
" \n",
" b | \n",
" 4 | \n",
" 7 | \n",
" 0 | \n",
" 8 | \n",
"
\n",
" \n",
" c | \n",
" 9 | \n",
" 7 | \n",
" 2 | \n",
" 9 | \n",
"
\n",
" \n",
" d | \n",
" 2 | \n",
" 4 | \n",
" 0 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D A B\n",
"a 9 2 0 2\n",
"b 4 7 0 8\n",
"c 9 7 2 9\n",
"d 2 4 0 6"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "markdown",
"id": "olive-bubble",
"metadata": {},
"source": [
"So, if we want our changes to be final, it will be necessary to assign the\n",
"DataFrame returned by the function to the starting DataFrame, as shown as\n",
"follows:"
]
},
{
"cell_type": "code",
"execution_count": 104,
"id": "4e8ffb99",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" C D A B\n",
"b 4 5 0 8\n",
"d 9 5 0 5\n",
"c 8 3 2 5\n",
"a 5 5 2 2\n"
]
}
],
"source": [
"df = df.sort_values(by='A')\n",
"print(df)"
]
},
{
"cell_type": "markdown",
"id": "a158209b",
"metadata": {},
"source": [
"## Reindexing"
]
},
{
"cell_type": "markdown",
"id": "decimal-boards",
"metadata": {},
"source": [
"Reindexing is the fundamental method of aligning data in Pandas.\n",
"Reindexing means conforming the data to match a given set of labels along\n",
"a particular axis. This involves a set of things, which are as follows:\n",
"\n",
"* Reordering of existing data to respond to the new set of labels\n",
"* The insertion of NaN in the positions where there is no\n",
"correspondence\n",
"* Possibility of filling in cases of missing data (NaN) with special data\n",
"following appropriate rules"
]
},
{
"cell_type": "markdown",
"id": "improving-notification",
"metadata": {},
"source": [
"Refer to the following:"
]
},
{
"cell_type": "code",
"execution_count": 98,
"id": "45b8eeef",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" D | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 7 | \n",
" 3 | \n",
" 8 | \n",
" 3 | \n",
"
\n",
" \n",
" d | \n",
" 3 | \n",
" 2 | \n",
" 8 | \n",
" 0 | \n",
"
\n",
" \n",
" c | \n",
" 1 | \n",
" 2 | \n",
" 6 | \n",
" 6 | \n",
"
\n",
" \n",
" a | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D A B\n",
"b 7 3 8 3\n",
"d 3 2 8 0\n",
"c 1 2 6 6\n",
"a 3 1 2 4"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.random.randint(10, size=16).reshape(4,4), \n",
" index=['b','d','c','a'], \n",
" columns=['C','D','A','B'])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "verified-insert",
"metadata": {},
"source": [
"To reindex, use the `reindex()` method specifying the new order of the\n",
"labels that the DataFrame must have. The method will not operate on the\n",
"reference DataFrame, but will return a copy of the DataFrame that will\n",
"follow the new indexing rules. To select which particular Index object will\n",
"be reindexed, we will add the axis parameter with its specific value as the\n",
"function argument, as shown as follows:\n",
" \n",
"* axis = 0 to edit rows (index)\n",
"* axis = 1 to edit columns (columns)"
]
},
{
"cell_type": "code",
"execution_count": 100,
"id": "eac79df5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" D | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" b | \n",
" 7 | \n",
" 3 | \n",
" 8 | \n",
" 3 | \n",
"
\n",
" \n",
" c | \n",
" 1 | \n",
" 2 | \n",
" 6 | \n",
" 6 | \n",
"
\n",
" \n",
" d | \n",
" 3 | \n",
" 2 | \n",
" 8 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D A B\n",
"a 3 1 2 4\n",
"b 7 3 8 3\n",
"c 1 2 6 6\n",
"d 3 2 8 0"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.reindex(['a','b','c','d'],axis=0)\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "armed-plenty",
"metadata": {},
"source": [
"As you can see, the order of the rows of the DataFrame has been completely\n",
"changed, following the order in the list of labels specified as an argument of\n",
"the method.\n",
"\n",
"The new DataFrame will be a completely independent copy, and any\n",
"changes made to it will not affect the original one in the least, as shown as\n",
"follows:"
]
},
{
"cell_type": "code",
"execution_count": 102,
"id": "ead02486",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" D | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" b | \n",
" 7 | \n",
" 0 | \n",
" 8 | \n",
" 3 | \n",
"
\n",
" \n",
" c | \n",
" 1 | \n",
" 2 | \n",
" 6 | \n",
" 6 | \n",
"
\n",
" \n",
" d | \n",
" 3 | \n",
" 2 | \n",
" 8 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D A B\n",
"a 3 1 2 4\n",
"b 7 0 8 3\n",
"c 1 2 6 6\n",
"d 3 2 8 0"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.iloc[1,1] = 0\n",
"df2\n"
]
},
{
"cell_type": "code",
"execution_count": 103,
"id": "narrative-annual",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
" D | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 7 | \n",
" 3 | \n",
" 8 | \n",
" 3 | \n",
"
\n",
" \n",
" d | \n",
" 3 | \n",
" 2 | \n",
" 8 | \n",
" 0 | \n",
"
\n",
" \n",
" c | \n",
" 1 | \n",
" 2 | \n",
" 6 | \n",
" 6 | \n",
"
\n",
" \n",
" a | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" C D A B\n",
"b 7 3 8 3\n",
"d 3 2 8 0\n",
"c 1 2 6 6\n",
"a 3 1 2 4"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"id": "overall-accounting",
"metadata": {},
"source": [
"But what happens if we insert labels that are not present in our DataFrame?\n",
"Or, if we omit some present label? Refer to the following:"
]
},
{
"cell_type": "code",
"execution_count": 104,
"id": "7b5196e9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 8 | \n",
" 3 | \n",
" 7 | \n",
" 3 | \n",
"
\n",
" \n",
" d | \n",
" 8 | \n",
" 0 | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" c | \n",
" 6 | \n",
" 6 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" a | \n",
" 2 | \n",
" 4 | \n",
" 3 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"b 8 3 7 3\n",
"d 8 0 3 2\n",
"c 6 6 1 2\n",
"a 2 4 3 1"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.reindex(['A','B','C','D'],axis=1)\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "foster-hebrew",
"metadata": {},
"source": [
"As we can see, if a new label is inserted in the reindexing, the method will\n",
"create a new column (or row) containing NaN (missing values) as values.\n",
"While if you do not report an existing label in the reindexing, the column\n",
"(or row) will be removed in the new returned DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 105,
"id": "59d52878",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" E | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" b | \n",
" 8 | \n",
" NaN | \n",
" 7 | \n",
" 3 | \n",
"
\n",
" \n",
" d | \n",
" 8 | \n",
" NaN | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" c | \n",
" 6 | \n",
" NaN | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" a | \n",
" 2 | \n",
" NaN | \n",
" 3 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A E C D\n",
"b 8 NaN 7 3\n",
"d 8 NaN 3 2\n",
"c 6 NaN 1 2\n",
"a 2 NaN 3 1"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.reindex(['A','E','C','D',],axis=1)\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "widespread-introduction",
"metadata": {},
"source": [
"### Reindexing using another DataFrame as a reference\n",
"There is also the possibility of using a reference DataFrame from which to\n",
"take the order of the indexes (both by row and by column) and apply it to\n",
"our DataFrame. This is possible with the `reindex_like()` method. This\n",
"method will also not change the object that it is called on but will return a\n",
"copy of the DataFrame as a result. In this regard, we create a reference DataFrame, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 106,
"id": "e88b3773",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" b | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" c | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" d | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"a 0.0 0.0 0.0 0.0\n",
"b 0.0 0.0 0.0 0.0\n",
"c 0.0 0.0 0.0 0.0\n",
"d 0.0 0.0 0.0 0.0"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfo = pd.DataFrame(np.zeros(16).reshape(4,4), \n",
" index=['a','b','c','d'], \n",
" columns=['A','B','C','D'])\n",
"dfo"
]
},
{
"cell_type": "markdown",
"id": "satisfactory-collect",
"metadata": {},
"source": [
"We will apply the `reindex_like()` method on our DataFrame on which we\n",
"want to reindex, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 107,
"id": "1dfa27c1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 2 | \n",
" 4 | \n",
" 3 | \n",
" 1 | \n",
"
\n",
" \n",
" b | \n",
" 8 | \n",
" 3 | \n",
" 7 | \n",
" 3 | \n",
"
\n",
" \n",
" c | \n",
" 6 | \n",
" 6 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" d | \n",
" 8 | \n",
" 0 | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C D\n",
"a 2 4 3 1\n",
"b 8 3 7 3\n",
"c 6 6 1 2\n",
"d 8 0 3 2"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.reindex_like(dfo)\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "becoming-stretch",
"metadata": {},
"source": [
"Also in this case, if in the reference DataFrame, there are labels that do not\n",
"exist in our DataFrame on which to index, these rows or columns will be\n",
"added to the new DataFrame returned with all the NaN values (missing\n",
"values) inside. Whereas, if in the reference DataFrame, some labels are not\n",
"present, the corresponding rows or columns will be removed from the new\n",
"DataFrame returned, as shown as follows:"
]
},
{
"cell_type": "code",
"execution_count": 108,
"id": "e6326a06",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" W | \n",
" Z | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" b | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" c | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" e | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B W Z\n",
"a 0.0 0.0 0.0 0.0\n",
"b 0.0 0.0 0.0 0.0\n",
"c 0.0 0.0 0.0 0.0\n",
"e 0.0 0.0 0.0 0.0"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfo = pd.DataFrame(np.zeros(16).reshape(4,4), \n",
" index=['a','b','c','e'], \n",
" columns=['A','B','W','Z'])\n",
"dfo"
]
},
{
"cell_type": "code",
"execution_count": 109,
"id": "9c2b8e1f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" W | \n",
" Z | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 2.0 | \n",
" 4.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" b | \n",
" 8.0 | \n",
" 3.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" c | \n",
" 6.0 | \n",
" 6.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" e | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B W Z\n",
"a 2.0 4.0 NaN NaN\n",
"b 8.0 3.0 NaN NaN\n",
"c 6.0 6.0 NaN NaN\n",
"e NaN NaN NaN NaN"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.reindex_like(dfo)\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "exciting-disease",
"metadata": {},
"source": [
"## Conclusion:"
]
},
{
"cell_type": "markdown",
"id": "subtle-jacket",
"metadata": {},
"source": [
"In this chapter, we dealt with a whole series of basic operations that apply to\n",
"DataFrame (and also to Series) and which greatly help in understanding this\n",
"type of object and how they work. The concept of selection is very\n",
"important because it allows us to focus our attention on a subset of the\n",
"DataFrame. This type of object often has dimensions that are too large to be\n",
"considered in their entirety, and in addition to the often-insufficient\n",
"selection, a filtering system that allows us to select only what meets\n",
"particular requirements that come to our aid."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c786c045",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}